This is an R Markdown document that was designed to review Seurat and
basic analysis of single cell RNA-seq (scRNA) data. For this tutorial we
will be using a few samples from GSE235326 from Kumar T,
Nee K, Wei R, He S et al. A spatially resolved single-cell genomic atlas
of the adult human breast. Nature 2023 Aug;620(7972):181-191. The
samples selected from this data set are from 10x Genomics Chromium
Single Cell 3’ v3.1 Chemistry.
This tutorial will review the following items in standard analysis of using the Seurat package:
If you want to know more about a function there is a function you
can use ?FUNCTION_NAME to examine the inputs, where
FUNCTION_NAME is the name of the function using the correct
function capitalization.
If you doing this in an interactive version you can use the files location and walk the click through to get the data location and set a new direction - using R studio use the side panel with the Files and you can walk through your home directory and set a new directory.
We will be using the following libraries. These should be available
if you are using the server. However, you might need to install them.
Note this tutorial was prepared using
R version 4.2.0 (2022-04-22) and Seurat 4.
If you needed to install them for:
install.packages('harmony')library(Seurat)
## Attaching SeuratObject
## Attaching sp
library(ggplot2)
library(patchwork)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readxl)
library(harmony)
## Loading required package: Rcpp
This will need to be changed to the specific location where you have placed the data. You should specific where you are working and set the directory.
your_working_dir <- "/Users/akcasasent/Desktop/Teaching/GSE235326/"
setwd(your_working_dir)
# primary filter
min_cells <- 0
min_features <- 100
# paper info
paper_dim <- 30
paper_res <- 0.2
# secondary filter
min_nFeature <- 200
max_nFeature <- 2500
min_nCount <- 500
max_nCount <- 20000
# after cluster based filtering (in this case)
max_mito <- 10
max_ribo <- 50
Before merging samples into one data set, I strongly suggest that you go through each sample individually to identify the clusters. This can also serve as a cross check to make sure that once you merge and integrate your samples, the clustering makes sense.
This point is also a good time to look at “bad” clusters, which have only mito or ribo genes differentially expressed or highly expressed. However, note that you need to be careful about these clean up steps.
Here, I am going to show the clean up after the merging/integrating for the sake of time.
However, I want it noted that each of these steps should actually be run on each individual sample for the best results. This should also reduce the number of cells you drop overall.
In science, we usually have multiple samples that we want to study together, rather than looking at a single sample (as we did in the previous Rmd file). Let’s discuss how we can handle multiple samples, including merging and integrating them.
Note This set up is not very helpful if you can only examine a single sample.
So, how do you handle multiple samples?
You can merge samples directly when you are reading in the samples using for loops or applies.
Before merging, a lot of time you will need to create the sample object(s) and you might want to add annotations based on sample information from a sample sheet so that the format is the same across all samples. (Seurat 5 is better sited to this but it possible using Seurat4 as well.)
Here we do an example with 4 samples, but shown in a way that it can be expanded to many more samples. In this case we ran it 2 samples that were processed with a 3 hour digestion and 3 samples with an overnight digestion.
Setting the directories that we are using to read in information or files.
input_dir <- paste0(your_working_dir, "input/")
data_dir <- paste0(your_working_dir, "data/")
You will want to have a list of samples of interest In some cases, you might want this to be the outer files, or label that can be cross referenced.
# gets the list of samples of interest
# in this case by Accession Number
# this could also be done using files read as well.
# 4 samples 2 left, 2 right, 2 hr and 2 overnight
#sam_names_list <- c("GSM7500363","GSM7500362", "GSM7500360", "GSM7500359")
# 4 samples 2 from over night 2 from 3 hour, all from contra
sam_names_list <- c("GSM7500495", "GSM7500499", "GSM7500500", "GSM7500512")
# extra
# "GSM7500496"
# "GSM7500513"
Sample information can be read in using the files of input.
# gets the list sample information file
sample_info <- readxl::read_excel(paste0(input_dir,"Sample_Info.xlsx"))
sample_info <- as.data.frame(sample_info)
This allows you to select only a few samples – basically the ones from the list provided earlier.
# selects only the samples of interest
sel_sample_info <- sample_info[sapply(sample_info$Accession, function(x) any(sam_names_list==x)),]
sel_sample_info
## Accession Sample_Name Patient_ID Replicate Side Digestion
## 137 GSM7500495 hbca_c44 P35 replicate_1 contra overnight_digestion
## 141 GSM7500499 hbca_c48 P43 replicate_1 contra 3hr_digestion
## 142 GSM7500500 hbca_c49 P43 replicate_1 contra overnight_digestion
## 154 GSM7500512 hbca_c165 P124 replicate_1 contra 3hr_digestion
Functions used in loop:
for - Loop is useful when the number of iterations is
known or fixed. It loops through each of the individual item in a
list.
which - returns the position of elements in a logical
vector that are TRUE.
get - gets an object x using the name of
string of the object from the a specific environment. The default is
your working environment.
assign - takes the x for the string
character of the name, and value is object that will be assigned this
name. This basically creates a new object of the same as value. It works
great in loops to create specific objects names.
sapply - is another version of apply, which is also
called simplified apply. It only required an X a vector,
list or object. This returns a vector or a matrix.
Other good points for sapply * If every element in the
list returned by the function is of length 1, a vector is returned. * If
every element in the list returned by the function is a vector of the
same length (>1), a matrix is returned. * If lengths vary,
simplification is not possible, and a list is returned.
length - the length provides a numeric of the number of
elements in a list or vectors.
break - this is a statement is essential for controlling
the flow of execution in loops. It allows for an immediate exit from a
loop, even if the loop’s condition is still true.
is.null - provides a logical. TRUE is it is null, and
FALSE if is something other than null.
! - this is the negative.
rm - this is used to removed each of object. It is a
good to clean up after a loop. Especially temp objects such as indexes
that would otherwise clutter you environment.
# creates a null object which will hold the list of all the seurat objects of interest
# not that often you would likely want to create a processing RMD for each individual sample. However, I would suggest you do so.
assign_name_list <- NULL
# so you don't need to get the path info each time
sample_paths <- list.files(data_dir, full.names = TRUE)
# loop through the samples of interest
for(sample_name in sam_names_list)
{
#path for each sample if they were in unique folders
#sample_path <- file.path(general_path,sample_name)
#gets the index that the sample path for specific
# named vector where the name is the sample location
sample_index <- which(sapply(sample_paths, function(x)
{grep(pattern=sample_name, x)}) >0)
# this part is code to check if you have multiple matches or no matches and breaks out of the loop if this occurs.
if(length(sample_index) > 1)
{
print("multiple matches")
break
}
if(length(sample_index) == 0)
{
print("no matches")
break
}
# this uses the min_cells and min_features from earlier.
# this can also be set to 0 is you want to do filtering later.
sample_mtx <- Read10X_h5(names(sample_index))
sample_seurat <- CreateSeuratObject(sample_mtx, min.cells = min_cells,
min.features = min_features,
project = sample_name)
# calculate the mito and ribo percentage
sample_seurat[["percent.mt"]] <- PercentageFeatureSet(sample_seurat, pattern = "^MT-")
sample_seurat[["percent.rb"]] <- PercentageFeatureSet(sample_seurat, pattern = "^RP[SL]")
# for each column in the sample info file add it to meta data for the sample
for(metaCol in colnames(sel_sample_info))
{
# row, col
# WARNING Hardcoding below
# Accession hard coded for getting info
sample_seurat@meta.data[[metaCol]] <- sel_sample_info[which(sel_sample_info$Accession==sample_name), which(colnames(sel_sample_info) == metaCol)]
}
# manually / Hardcoding this would look something like this
#sample_seurat_sub@meta.data[["Treatment"]] <- sel_sample_info$Treatment[which(sel_sample_info$Sample==sample_name)]
#filter samples
sample_seurat <- subset(sample_seurat, subset = nFeature_RNA > min_nFeature &
nFeature_RNA < max_nFeature &
nCount_RNA > min_nCount &
nCount_RNA < max_nCount )
# assigns to a new alias
# this good way to make objects in loops
print(paste0(sample_name,"_seurat"))
assign_name <- paste0(sample_name,"_seurat")
assign(x=assign_name, value = sample_seurat)
#compiles the list of all names
if(!is.null(assign_name_list)){
assign_name_list <- c(assign_name_list,assign_name)
}
if(is.null(assign_name_list)){
assign_name_list <- assign_name
}
}
## [1] "GSM7500495_seurat"
## [1] "GSM7500499_seurat"
## [1] "GSM7500500_seurat"
## [1] "GSM7500512_seurat"
# clean up loop info
rm(sample_index, sample_mtx, sample_seurat, sample_name, assign_name)
When you are merging the samples, you just take the list of samples that you have. It should also include all the items in the meta data that you previously had.
Note that while merge does not integrate, it lets you see if you need to integrate and possibly over what.
# note merge DOES not integrate
hbca.merged <- merge(get(assign_name_list[1]), y = sapply(assign_name_list[-1],function(x) get(x)),
add.cell.ids = sam_names_list,
project = "HBCA")
#clean up data
rm(list =assign_name_list)
rm(assign_name_list)
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 3686391 196.9 14087370 752.4 NA 14582359 778.8
## Vcells 146059771 1114.4 447726858 3415.9 32768 443853885 3386.4
After samples are merged you have to do dim reduction, find neighbors, clusters and then create a UMAP from that data.
This takes a little bit of time.
hbca.merged <- ScaleData(hbca.merged)
## Centering and scaling data matrix
hbca.merged <- FindVariableFeatures(hbca.merged)
hbca.merged <- RunPCA(hbca.merged, features = VariableFeatures(object = hbca.merged))
## PC_ 1
## Positive: SOD2, GAPDH, FTH1, C1S, COL6A2, BNIP3, C1R, MMP2, CST3, BNIP3L
## MT2A, NAMPT, NNMT, CEBPB, IL1R1, PID1, GSN, CTSL, IRAK3, PDGFRA
## FAM162A, VIM, GPNMB, NFKBIA, TNFAIP6, PDPN, PLPP3, COL6A3, CLMP, PRRX1
## Negative: PTPRC, IL7R, CD3D, CD2, LEPROTL1, CD3E, CD48, FYB1, ARHGDIB, GIMAP7
## CXCR4, SPOCK2, TRAC, TRBC2, CLEC2D, CD7, CD3G, CD52, EVI2B, ITK
## LTB, IKZF3, CYTIP, TRAT1, CD37, EZR, KIAA1551, SMAP2, TRBC1, GIMAP4
## PC_ 2
## Positive: LAPTM5, TYROBP, FCER1G, SAMSN1, SRGN, TLR2, TNFRSF1B, MS4A7, RNASET2, CYBA
## GNA15, FTL, EPB41L3, CD74, FCGR2A, CTSB, SLC11A1, LCP1, CTSS, C1QC
## CD68, SPI1, HLA-DQB1, ITGB2, SLC16A10, CD163, BCAT1, PIK3AP1, HLA-DRA, C1QB
## Negative: DCN, C1R, COL6A2, C1S, GSN, LUM, RARRES2, MFAP4, FBLN1, SERPINF1
## SOD3, GEM, NNMT, GPC3, FXYD1, FSTL1, COL6A1, CCDC80, MMP2, PDGFRL
## RND3, IGFBP4, CRISPLD2, SLIT3, GPRC5A, FOSB, PCOLCE, COL14A1, TIMP3, COL6A3
## PC_ 3
## Positive: C1S, MMP2, COL6A2, PDGFRA, C1R, TNFAIP6, COL6A3, DCN, CFD, FBLN1
## GPC3, SERPINF1, C3, MEDAG, THBS2, CLMP, SRPX, PRRX1, PTPRC, PTGES
## PID1, LUM, CYP1B1, PDPN, TWIST2, PCOLCE, LPAR1, DPT, RARRES2, IL7R
## Negative: ACTG1, TM4SF1, ACTB, SPARCL1, NEDD9, CD9, CAV2, ADGRL4, PLS3, BACE2
## ESAM, PECAM1, JAG1, MAST4, TCF4, ACTN1, CDC42EP3, PNP, TPM1, CALCRL
## SPINT2, MGST2, MGLL, CD93, PLVAP, CRIP2, ADAMTS9, FLNB, TM4SF18, PCAT19
## PC_ 4
## Positive: TPM1, SPINT2, DST, ACTA2, TAGLN, CSRP1, KRT17, KRT8, TNFRSF12A, KRT5
## PAWR, CNN1, PALLD, ATP1B1, TPM2, FHL2, PPP1R12B, TACSTD2, FOSB, KRT7
## KRT14, NEDD4L, SFN, LAMA3, COL17A1, ITGA2, DSP, CDH1, SERTAD1, FXYD3
## Negative: ADGRL4, TCF4, SERPINE1, ESAM, PECAM1, CALCRL, GNG11, PLVAP, SPRY1, ADAMTS9
## PCAT19, CDH5, EMCN, MGST2, AFAP1L1, BCL6B, MYCT1, PXDN, CLDN5, SOX18
## CD93, ECSCR, MECOM, TIE1, PREX2, SPARCL1, COL4A1, DOCK9, ANGPT2, ADGRF5
## PC_ 5
## Positive: BNIP3, BNIP3L, GAPDH, SOD2, FTH1, HILPDA, WIPI1, PTGES, FAM162A, PDIA6
## SLC7A2, MGST1, PLIN2, TBX3, THBS2, PID1, CLMP, MANF, SDF2L1, TNFAIP6
## ARSG, CEBPD, SPINT2, IRAK3, PDPN, PLOD2, CA12, PDIA4, DDIT4, HSP90B1
## Negative: TMSB4X, DCN, FOSB, JUNB, KLF4, TIMP2, TIMP3, SELENOP, FXYD1, VIM
## ID3, CXCL12, LMNA, MXRA8, MFAP4, SERTAD1, ITM2A, EMP1, PDGFRL, OMD
## HSP90AA1, SOCS3, S100A4, NR4A1, GSN, OLFML3, LUM, EGR1, HTRA1, NFATC2
hbca.merged <- FindNeighbors(hbca.merged, dims = 1:paper_dim)
## Computing nearest neighbor graph
## Computing SNN
hbca.merged <- FindClusters(hbca.merged, resolution = paper_res)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 34835
## Number of edges: 1279433
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9688
## Number of communities: 15
## Elapsed time: 5 seconds
hbca.merged <- RunUMAP(hbca.merged, dims = 1:paper_dim, spread=1)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 18:59:29 UMAP embedding parameters a = 0.9922 b = 1.112
## 18:59:29 Read 34835 rows and found 30 numeric columns
## 18:59:29 Using Annoy for neighbor search, n_neighbors = 30
## 18:59:29 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 18:59:32 Writing NN index file to temp file /var/folders/3_/lw61drbn68xglctt7ljrsmrcd_ghr5/T//Rtmpsdg3ET/file93064883646a
## 18:59:32 Searching Annoy index using 1 thread, search_k = 3000
## 18:59:41 Annoy recall = 100%
## 18:59:43 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 18:59:44 Initializing from normalized Laplacian + noise (using irlba)
## 18:59:50 Commencing optimization for 200 epochs, with 1515572 positive edges
## 19:00:09 Optimization finished
Plot the umap while looking at different clusters.
DimPlot(hbca.merged, reduction = "umap", group.by=paste0("RNA_snn_res.",paper_res))
DimPlot(hbca.merged, reduction = "umap", group.by="Accession")
DimPlot(hbca.merged, reduction = "umap", group.by="Patient_ID")
In some cases in this case side is just the contralateral side.
DimPlot(hbca.merged, reduction = "umap", group.by="Side")
Here we can clearly see that the Digestion likely has the largest batch effect and is therefore the best one to integrate over.
DimPlot(hbca.merged, reduction = "umap", group.by="Digestion")
DimPlot(hbca.merged, reduction = "umap", group.by="Patient_ID", split.by = "Digestion")
SCTransform should be performed on each sample individually before integration, but you also need it to be the same model which can be confusing.
One of the major issues you will face is that SCTranform takes a lot of memory. Therefore, some systems will have trouble running this will so many samples. We are only using 4 samples in order to be able to run on the machines provided. If there are issues we can reduce the number of samples down to 2 and run it just on those 2 samples.
However this will result in errors such as “SCT assay is comprised of multiple SCT models”.
Functions:
Sys.time - this provides system time. It allows you to
figure out how long items took to run.
%>% - this is from dplyr library. It
allows you push the object to the next function.
This allows you can move items to the next function. Here is an example of how push the function output to the next function. It provides a good pipeline.
Sys.time()
## [1] "2024-03-25 19:00:19 CDT"
hbca.merged <- hbca.merged %>%
NormalizeData() %>%
FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>%
ScaleData() %>%
SCTransform(vars.to.regress = c("percent.mt"))
## Centering and scaling data matrix
## Calculating cell attributes from input UMI matrix: log_umi
## Variance stabilizing transformation of count matrix of size 23239 by 34835
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 5000 cells
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## There are 3 estimated thetas smaller than 1e-07 - will be set to 1e-07
## Found 90 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 23239 genes
##
|
| | 0%
|
|= | 2%
|
|=== | 4%
|
|==== | 6%
|
|====== | 9%
|
|======= | 11%
|
|========= | 13%
|
|========== | 15%
|
|============ | 17%
|
|============= | 19%
|
|=============== | 21%
|
|================ | 23%
|
|================== | 26%
|
|=================== | 28%
|
|===================== | 30%
|
|====================== | 32%
|
|======================== | 34%
|
|========================= | 36%
|
|=========================== | 38%
|
|============================ | 40%
|
|============================== | 43%
|
|=============================== | 45%
|
|================================= | 47%
|
|================================== | 49%
|
|==================================== | 51%
|
|===================================== | 53%
|
|======================================= | 55%
|
|======================================== | 57%
|
|========================================== | 60%
|
|=========================================== | 62%
|
|============================================= | 64%
|
|============================================== | 66%
|
|================================================ | 68%
|
|================================================= | 70%
|
|=================================================== | 72%
|
|==================================================== | 74%
|
|====================================================== | 77%
|
|======================================================= | 79%
|
|========================================================= | 81%
|
|========================================================== | 83%
|
|============================================================ | 85%
|
|============================================================= | 87%
|
|=============================================================== | 89%
|
|================================================================ | 91%
|
|================================================================== | 94%
|
|=================================================================== | 96%
|
|===================================================================== | 98%
|
|======================================================================| 100%
## Computing corrected count matrix for 23239 genes
##
|
| | 0%
|
|= | 2%
|
|=== | 4%
|
|==== | 6%
|
|====== | 9%
|
|======= | 11%
|
|========= | 13%
|
|========== | 15%
|
|============ | 17%
|
|============= | 19%
|
|=============== | 21%
|
|================ | 23%
|
|================== | 26%
|
|=================== | 28%
|
|===================== | 30%
|
|====================== | 32%
|
|======================== | 34%
|
|========================= | 36%
|
|=========================== | 38%
|
|============================ | 40%
|
|============================== | 43%
|
|=============================== | 45%
|
|================================= | 47%
|
|================================== | 49%
|
|==================================== | 51%
|
|===================================== | 53%
|
|======================================= | 55%
|
|======================================== | 57%
|
|========================================== | 60%
|
|=========================================== | 62%
|
|============================================= | 64%
|
|============================================== | 66%
|
|================================================ | 68%
|
|================================================= | 70%
|
|=================================================== | 72%
|
|==================================================== | 74%
|
|====================================================== | 77%
|
|======================================================= | 79%
|
|========================================================= | 81%
|
|========================================================== | 83%
|
|============================================================ | 85%
|
|============================================================= | 87%
|
|=============================================================== | 89%
|
|================================================================ | 91%
|
|================================================================== | 94%
|
|=================================================================== | 96%
|
|===================================================================== | 98%
|
|======================================================================| 100%
## Calculating gene attributes
## Wall clock passed: Time difference of 2.868889 mins
## Determine variable features
## Place corrected count matrix in counts slot
## Regressing out percent.mt
## Centering data matrix
## Set default assay to SCT
Sys.time()
## [1] "2024-03-25 19:04:03 CDT"
CCA this is actually one of the strongest integration methods.
In general, I do not suggest that people use CCA because it often over integrates or over smoothes the data causing the data to look like one glob, which prevents real differences from being measured. This occurs because CCA’s underlying assumption is that the samples/batches should be more similar than different.
We are specifically going to correct the batches based on Digestion.
To do this we need to create objects or layers for the batches. This is something that changes between seurat 4 and seurat 5. So make sure when you are working you know what version you are using.
gc - this is helpful in reducing the extra variables. It
stands for garbage collection.
SplitObject - part of the Suerat package. This splits
the object into a list.
Note how the split works changes between suerat 4 and suerat 5.
# Clear out memory
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 3981827 212.7 11269896 601.9 NA 14582359 778.8
## Vcells 1698963350 12962.1 4041898167 30837.3 32768 4041896955 30837.3
# splits into different samples or groups which you want to integrate over
# similar to how we did the SCTransform
# this is based on group you are integrating over.
hbca.list <- SplitObject(hbca.merged, split.by = "Digestion")
List the different object and make sure to normalize these different items.
lapply - list apply. This returns the results in a list
format.
hbca.list <- lapply(X = hbca.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
Select features that are repeatedly variable across datasets for integration, run PCA on each dataset using these features.
SelectIntegrationFeatures - provides an object.list. The
features is used when integrating multiple datasets. It ranks features
by the number of datasets they are deemed variable in, breaking ties by
the median variable feature rank across datasets. Then it returns the
top scoring features by this ranking.
hbca.features <- SelectIntegrationFeatures(object.list = hbca.list, )
hbca.list <- lapply(X = hbca.list, FUN = function(x) {
x <- ScaleData(x, features = hbca.features, verbose = FALSE)
x <- RunPCA(x, features = hbca.features, verbose = FALSE)
})
The next step in CCA integration is to find anchor genes. Anchor genes are genes that do not change and can be used to combine the items (previously defined as housekeeping genes). This step may take a while depending on your sample size and number of clusters. The larger the dataset, the more “batches” and the more clusters the longer this will take.
Note This is one of the items that you will really want to make sure you run as script on a HPC (high performance cluster) and that you save the results.
FindIntegrationAnchors - Find a set of anchors between a
list of seurat objects. These anchors can later be used to integrate the
objects using the IntegrateData function. The reduction matters as it
how the archors are defined.
Sys.time()
## [1] "2024-03-25 19:04:49 CDT"
hbca.anchors <- FindIntegrationAnchors(object.list = hbca.list, anchor.features = hbca.features, reduction = "cca")
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 21867 anchors
## Filtering anchors
## Retained 8340 anchors
Sys.time()
## [1] "2024-03-25 19:24:30 CDT"
Then we integrate using CCA. This also can take a while.
Sys.time()
## [1] "2024-03-25 19:24:30 CDT"
# this command creates an 'integrated' data assay
hbca.cca <- IntegrateData(anchorset = hbca.anchors)
## Merging dataset 2 into 1
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
Sys.time()
## [1] "2024-03-25 19:25:29 CDT"
In order to make sure you are working with the integrated data, you need to change the default assay.
DefaultAssay - this tells the different functions in
Seurat which slot of the data to pull first. The original default was
“RNA” here we reset it to use the integrated. It can be reset to also
use other assays.
# you need to make sure you know what you are working with so you need to tell it
# Note that the original, unmodified data still resides in the 'RNA' assay -- if you use that you are using the raw data
# they won't work the same
# so we tell the defaults what we want to work with.
DefaultAssay(hbca.cca) <- "integrated"
After you have integrated the data, you need to make sure to run the standard workflow. And then you can look at the data in more depth. You still need to process the data in order for it to be examined.
Sys.time()
## [1] "2024-03-25 19:25:29 CDT"
# note that we can write it this way which takes up a lost less space and sending the each items to the next.
# however this way can be harder to debug
hbca.cca <- hbca.cca %>%
ScaleData() %>%
RunPCA(npcs = paper_dim, verbose = FALSE) %>%
RunUMAP(reduction = "pca", dims = 1:paper_dim) %>%
FindNeighbors(reduction = "pca", dims = 1:paper_dim)
## Centering and scaling data matrix
## 19:25:45 UMAP embedding parameters a = 0.9922 b = 1.112
## 19:25:45 Read 34835 rows and found 30 numeric columns
## 19:25:45 Using Annoy for neighbor search, n_neighbors = 30
## 19:25:45 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 19:25:49 Writing NN index file to temp file /var/folders/3_/lw61drbn68xglctt7ljrsmrcd_ghr5/T//Rtmpsdg3ET/file930657ed6c04
## 19:25:49 Searching Annoy index using 1 thread, search_k = 3000
## 19:25:57 Annoy recall = 100%
## 19:25:58 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 19:25:59 Initializing from normalized Laplacian + noise (using irlba)
## 19:26:02 Commencing optimization for 200 epochs, with 1524190 positive edges
## 19:26:21 Optimization finished
## Computing nearest neighbor graph
## Computing SNN
# this can also be writen this way
# Run the standard workflow for visualization and clustering
# Note that you start will scaling the data
# hbca.cca <- ScaleData(hbca.cca, verbose = FALSE)
# hbca.cca <- RunPCA(hbca.cca, npcs = paper_dim, verbose = FALSE)
# hbca.cca <- RunUMAP(hbca.cca, reduction = "pca", dims = 1:paper_dim)
# hbca.cca <- FindNeighbors(hbca.cca,reduction = "pca", dims = 1:paper_dim)
Sys.time()
## [1] "2024-03-25 19:26:29 CDT"
plot_by_side_one <- DimPlot(hbca.cca, reduction = "umap", group.by = "Digestion")
plot_by_side_one
plot_by_side_split <- DimPlot(hbca.cca, reduction = "umap", group.by = "Patient_ID",
split.by = "Digestion")
plot_by_side_split
Note one thing to keep in mind is that the Seurat package is always evolving. In seurat 5 you can keep everything together on one object. But in seurat 4 we had to separate it out.
Take a look at the newer vigettes to get an idea: https://satijalab.org/seurat/articles/integration_introduction.html
It would look something like this. Note this this code chunk is not run nor checked due to eval = FALSE in the code chunk header.
hbca.merged <- IntegrateLayers(object = hbca.merged, method = CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca",
verbose = FALSE)
# re-join layers after integration
hbca.merged[["RNA"]] <- JoinLayers(hbca.merged[["RNA"]])
hbca.merged <- FindNeighbors(hbca.merged, reduction = "integrated.cca", dims = 1:paper_dim)
hbca.merged <- FindClusters(hbca.merged, resolution = 1)
There are actually a lot of different integration methods that can be considered. When you are working with samples that have very high diversity you will want to strongly consider what integration method makes the most sense for your samples.
A major issue with integration is the concern of over correcting, while other methods do not appear to correct enough. There is a pipeline from Dr. Linghau Wang’s lab that produces a good example and examines the “batches” across the different integration methods.
Another integration tool you could consider using is Harmony, which is often used across samples or across digestions. Note that harmony is much faster than CCA, which means you can integrate over larger number of items more quickly. It also does less fitting which means that it is less likely to over correct.
library(harmony)
#library(SeuratWrappers)
Sys.time()
## [1] "2024-03-25 19:26:31 CDT"
# this is if you are doing by sample in this case by Accession
#hbca.harmony <- SCTransform(hbca.merged, vars.to.regress = 'Accession')
#hbca.harmony <- SCTransform(hbca.merged, vars.to.regress = 'Digestion')
hbca.harmony <- hbca.merged %>%
RunPCA() %>%
RunHarmony(group.by.vars='Digestion', assay.use='SCT') %>%
RunUMAP(reduction = "harmony", dims = 1:paper_dim, reduction.name = 'harmonyumap', reduction.key = 'harmonyumap_') %>%
FindNeighbors(reduction = "harmony", dims = 1:paper_dim) %>%
FindClusters(resolution = paper_res)
## PC_ 1
## Positive: CXCL8, CFD, FTH1, DCN, TNFAIP6, APOD, CXCL3, PLIN2, FTL, TIMP1
## SOD2, IL6, MT2A, C11orf96, CXCL1, LUM, CTSL, HILPDA, SERPINE1, CXCL2
## GSN, THBS2, MMP3, G0S2, MEDAG, AKR1C1, IER3, HMOX1, SERPINE2, IGFBP6
## Negative: KRT17, KRT14, IL7R, DST, SAA1, TAGLN, MTRNR2L8, KRT5, CXCR4, PTPRC
## ACTA2, TACSTD2, CD2, KRT8, TPM1, KRT7, KLRB1, EZR, STK4, SARAF
## BTG1, LTB, SFN, FYB1, CRYAB, TRBC2, CD3D, SRGN, CCL5, LEPROTL1
## PC_ 2
## Positive: KRT17, KRT14, TAGLN, DST, DCN, SAA1, MT2A, APOD, KRT5, CFD
## ACTA2, MT1X, CYR61, FOSB, FOS, CTGF, TPM1, MGP, LUM, CCL2
## EGR1, CEBPD, GSN, CRYAB, CXCL14, TPM2, MEG3, CALD1, MYLK, FBXO32
## Negative: HLA-DRA, SRGN, IL7R, CD74, CXCL8, FTL, FCER1G, HLA-DRB1, CD163, TYROBP
## CXCR4, HLA-DQA1, PTPRC, HLA-DPB1, C15orf48, HLA-DPA1, S100A9, C1QB, C1QA, FTH1
## IL1B, CTSB, MS4A7, HLA-DQB1, C1QC, MTRNR2L8, BTG1, RGS1, CCL4, RNASE1
## PC_ 3
## Positive: DCN, APOD, CFD, IL7R, LUM, MGP, MTRNR2L8, GSN, PTPRC, TNFAIP6
## IGFBP6, MEG3, FBLN1, CXCR4, SFRP2, COL1A2, TMSB4X, COL1A1, CCDC80, OGN
## CD2, RPS27, KLRB1, CTSK, DPT, MFAP4, RARRES2, RPS12, RPS29, MFAP5
## Negative: KRT17, KRT14, CXCL8, SAA1, DST, FTL, FTH1, TAGLN, KRT5, FCER1G
## CD163, HLA-DRA, C15orf48, S100A9, TYROBP, MT2A, ACTA2, MT1X, C1QB, THBS1
## CYR61, CTSB, IL1B, C1QA, RNASE1, C1QC, MS4A7, CD74, CCL3, IER3
## PC_ 4
## Positive: TM4SF1, SERPINE1, ANGPT2, SPARCL1, SPRY1, ADAMTS9, CLDN5, C2CD4B, STC1, SOX18
## FABP4, CD93, ADGRL4, PECAM1, PGF, DEPP1, EMCN, TCF4, CALCRL, AKAP12
## MGST2, PLVAP, TM4SF18, JAG1, GNG11, MCAM, ACKR1, IGFBP7, CAV1, ADGRF5
## Negative: DCN, KRT17, KRT14, CFD, DST, TAGLN, APOD, SAA1, HLA-DRA, FCER1G
## CD163, IL7R, KRT5, LUM, TYROBP, FTL, MT1X, S100A9, FOSB, ACTA2
## C1QB, MEG3, SRGN, C1QA, ANXA1, CXCL14, MGP, C1QC, THBS1, FTH1
## PC_ 5
## Positive: DCN, HLA-DRA, APOD, FOS, CTGF, CD74, FOSB, HLA-DRB1, SPARCL1, HSPA1A
## MGP, LUM, GSN, TM4SF1, HLA-DPA1, JUNB, EMP1, GADD45B, HLA-DPB1, CCL2
## ID3, CD163, HLA-DQA1, EGR1, SOCS3, MEG3, TIMP3, HLA-DRB5, C1QB, TYROBP
## Negative: TNFAIP6, FTH1, MT2A, CXCL8, IL6, MMP3, PLIN2, C11orf96, SOD2, HILPDA
## SAA1, SERPINE2, CXCL1, KRT17, THBS2, AKR1C1, TIMP1, MEDAG, MGST1, MT1X
## KRT14, CEBPD, CLEC2B, CXCL3, AKR1C2, IL7R, SLC7A2, CYP1B1, RSPO3, LIF
## Transposing data matrix
## Initializing state using k-means centroids initialization
## Harmony 1/10
## Harmony 2/10
## Harmony converged after 2 iterations
## Warning: Invalid name supplied, making object name syntactically valid. New
## object name is Seurat..ProjectDim.SCT.harmony; see ?make.names for more details
## on syntax validity
## 19:27:29 UMAP embedding parameters a = 0.9922 b = 1.112
## 19:27:29 Read 34835 rows and found 30 numeric columns
## 19:27:29 Using Annoy for neighbor search, n_neighbors = 30
## 19:27:29 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 19:27:33 Writing NN index file to temp file /var/folders/3_/lw61drbn68xglctt7ljrsmrcd_ghr5/T//Rtmpsdg3ET/file93064a63fe5c
## 19:27:33 Searching Annoy index using 1 thread, search_k = 3000
## 19:27:42 Annoy recall = 100%
## 19:27:43 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 19:27:44 Initializing from normalized Laplacian + noise (using irlba)
## 19:27:47 Commencing optimization for 200 epochs, with 1510364 positive edges
## 19:28:05 Optimization finished
## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 34835
## Number of edges: 1235963
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9631
## Number of communities: 14
## Elapsed time: 5 seconds
hbca.harmony
## An object of class Seurat
## 56777 features across 34835 samples within 2 assays
## Active assay: SCT (23239 features, 3000 variable features)
## 1 other assay present: RNA
## 4 dimensional reductions calculated: pca, umap, harmony, harmonyumap
#hbca.harmony <- RunPCA(hbca.harmony)
#hbca.harmony <- RunHarmony(hbca.harmony, group.by.vars='Digestion', assay.use='SCT')
#hbca.harmony <- RunUMAP(hbca.harmony, reduction = "harmony", dims = 1:paper_dim, reduction.name = 'harmonyumap', reduction.key = 'harmonyumap_')
#hbca.harmony <- FindNeighbors(hbca.harmony, reduction = "harmony", dims = 1:paper_dim)
#hbca.harmony <- FindClusters(hbca.harmony, resolution = paper_res)
#hbca.harmony
Sys.time()
## [1] "2024-03-25 19:28:19 CDT"
plot_by_side_harmony <- DimPlot(hbca.harmony, reduction = "harmonyumap", group.by = "Digestion")
plot_by_side_harmony
plot_by_side_harmony <- DimPlot(hbca.harmony, reduction = "harmonyumap", group.by = "Digestion", split.by = "Digestion")
plot_by_side_harmony
## Post Intergration Filtering (after harmony)
Note, that I strongly suggest that most of the filtering take place actually sample by sample. However, here I just wanted to go through some of the clean up that you might want to do, using “overcluster” to find outliers and drop a cluster. I am partly doing this on the integrated data set to give you an idea and because precleaning the samples before integration is fairly interactive. Note that it needs to be recorded and reported, but it is often difficult for people to maintain records, that this why using the RMD or other method to recording all your pre-processing is needed to make the analysis make sense.
While we previously filtered the data based on items like percentage mitochondria, number of counts and other items.
One of the major ways that cleaning can be done is by “overclustering” AKA using a high resolution and removing “bad” clusters. These cluster are clusters that are outliers for different reasons.
It could be only one patient sample is present in this cluster meaning might be a reason for someone not to trust it. This rarely happens after integration because of how integration tried to force the samples to be similar. The type of integration you use will can cause more or less smoothing.
Systematically, we can try and look for clusters to remove here based on is mostly what think are dead cells due having a higher mito, or ribo percentage.
Here, we will go through the process of over clustering and checking which clusters one might want to remove. Note here we “overcluster” by what should be a lot or a very high resolution during this step.
hbca.harmony <- FindClusters(hbca.harmony, resolution = 1.2)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 34835
## Number of edges: 1235963
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8978
## Number of communities: 32
## Elapsed time: 6 seconds
harmony_filter_clusters_umap <- DimPlot(hbca.harmony, reduction = "harmonyumap", group.by = "SCT_snn_res.1.2", label = TRUE)
harmony_filter_clusters_umap
Which clusters have really high mito percentage?
What biological reason might a cluster have high mito score?
What technical reason might cause a cluster to have a higher mito score?
When should you remove the cluster?
The last one is the hardest. Consider if the cluster is mostly just mito genes, is there a biological reason for it, or not. Remember stress can be a biological reason, however you might have technical items that inform you as well.
VlnPlot(hbca.harmony, features = c("percent.mt"))
## Ribo
Which clusters have really high ribo percentage?
What biological reason might a cluster have high ribo score?
What technical reason might cause a cluster to have a higher ribo score?
When should you remove the cluster?
This is similar questions as previous, but also consider if you are only seeing high ribo and mito scores.
VlnPlot(hbca.harmony, features = c("percent.rb"))
You can also filter based on a statistical criteria. This is something that I suggest you do when you are actually filtering.
What would this math look like?
You can use either median (non-parametric) or mean (parametric) methods. Just know which you are using. If you have normalized like here and have not filtered on this before you can use parametric methods. If you have filtered before on these metrics, you need to change it to the non-parametic and consider if you are over filtering.
Example test would be a wilcox test between groups. Best method would be this group against all other groups.
Or you can be view as a summary table.
Here we summarize by cluster and get the mean and sd of the different metrics. Not that is using mean +/- 2 * sd for any of these you don’t see clusters to be removed.
hbca.cluster.summary <- hbca.harmony@meta.data %>%
group_by(SCT_snn_res.1.2) %>%
summarise(mean_count = mean(nCount_RNA, na.rm = TRUE),
sd_count = sd(nCount_RNA, na.rm = TRUE),
mean_feature = mean(nFeature_RNA, na.rm = TRUE),
sd_feature = sd(nFeature_RNA, na.rm = TRUE),
mean_mt = mean(percent.mt, na.rm = TRUE),
sd_mt = sd(percent.mt, na.rm = TRUE),
mean_rb = mean(percent.rb, na.rm = TRUE),
sd_rb = sd(percent.rb, na.rm = TRUE),
mean_mtrb = mean((percent.mt + percent.rb), na.rm = TRUE),
sd_mtrb = sd((percent.mt + percent.rb), na.rm = TRUE))
#this lets it print out all the samples and their information.
as.data.frame(hbca.cluster.summary)
## SCT_snn_res.1.2 mean_count sd_count mean_feature sd_feature mean_mt
## 1 0 6033.2553 1940.6358 1471.0738 324.3620 4.729693
## 2 1 3926.2951 2254.8642 1290.7648 572.1410 10.228064
## 3 2 4669.5187 1972.5210 1493.9061 515.0678 8.296595
## 4 3 7484.5311 2376.2919 1919.9018 435.8662 9.591496
## 5 4 2666.6196 1786.6718 925.7418 472.7027 10.900661
## 6 5 2932.9639 1211.6956 1091.6803 287.5460 7.014291
## 7 6 4086.9396 2260.0145 1195.8621 496.8800 9.419994
## 8 7 5566.5789 2299.2104 1675.0275 531.1316 8.863821
## 9 8 5315.6231 2585.0834 1685.4579 568.6595 7.382715
## 10 9 5056.9523 2876.3921 1576.0928 716.8003 12.662055
## 11 10 5164.8217 3771.8189 1338.4347 742.8546 8.328130
## 12 11 1183.1318 1068.0836 557.0847 297.1711 7.334250
## 13 12 6259.7750 2137.0356 1818.4484 458.5992 7.505742
## 14 13 4518.7659 1774.4239 1369.2273 347.6401 4.842095
## 15 14 6262.4010 2133.5258 1604.1878 408.6763 7.586661
## 16 15 1311.1242 918.1463 525.8442 252.3088 23.661653
## 17 16 3452.9824 3255.9250 923.3300 597.5610 15.053233
## 18 17 1100.5349 1041.9698 412.7660 243.7059 6.858958
## 19 18 4783.4241 2355.9566 1443.1269 562.9797 9.472183
## 20 19 1229.9250 1008.4610 484.4241 253.0290 27.466915
## 21 20 1529.9096 847.7010 440.1704 201.4161 25.747381
## 22 21 782.0244 632.0399 464.7073 139.6666 2.733713
## 23 22 3344.0861 2349.0450 1073.2398 584.5021 15.167215
## 24 23 3393.2017 1305.3385 1259.2536 332.4046 8.321122
## 25 24 4177.3568 2355.7119 1262.7357 567.9310 7.388694
## 26 25 9203.4588 5553.3197 1335.1648 597.7381 4.559564
## 27 26 3700.8532 1883.3573 1205.2239 466.5853 10.739742
## 28 27 6493.2791 2998.5836 1699.5736 591.4733 12.025459
## 29 28 4371.1306 2522.3129 1478.2789 625.1258 6.660825
## 30 29 5142.1184 3392.6318 1437.1645 766.0888 7.895036
## 31 30 2530.8941 1972.7195 971.3178 525.2602 10.076687
## 32 31 4320.8201 2290.0274 1098.7986 425.0446 7.158834
## sd_mt mean_rb sd_rb mean_mtrb sd_mtrb
## 1 1.970791 44.653494 6.250794 49.38319 5.843932
## 2 7.307552 23.581962 9.703908 33.81003 9.448785
## 3 4.315238 26.001006 6.792560 34.29760 7.459120
## 4 3.722353 24.570327 4.704891 34.16182 4.871564
## 5 8.843798 29.239150 11.019589 40.13981 9.089828
## 6 4.323457 28.470803 7.448514 35.48509 6.039713
## 7 6.343405 23.816322 9.213433 33.23632 8.041491
## 8 4.213536 25.280822 5.827483 34.14464 5.955860
## 9 4.636946 18.339165 7.338438 25.72188 7.373823
## 10 9.495714 18.012983 8.465894 30.67504 8.264189
## 11 7.594764 12.088936 8.064816 20.41707 9.030351
## 12 8.437952 24.482260 11.251332 31.81651 9.677675
## 13 4.185512 22.987708 4.660961 30.49345 5.324920
## 14 2.462793 33.440684 7.001164 38.28278 6.295909
## 15 3.513778 38.456694 8.245851 46.04336 6.831848
## 16 15.646704 11.013411 10.655755 34.67506 15.137684
## 17 16.919603 31.141609 18.710249 46.19484 14.531341
## 18 8.891269 43.655067 14.711844 50.51403 12.481494
## 19 4.538652 23.554494 7.850287 33.02668 9.113335
## 20 24.888493 6.808604 6.056845 34.27552 25.783622
## 21 9.593057 24.205138 5.236502 49.95252 8.848097
## 22 2.876071 27.496262 7.374202 30.22997 7.235492
## 23 13.951469 24.459261 10.864352 39.62648 13.227447
## 24 3.465236 25.145386 6.561584 33.46651 6.278283
## 25 5.687810 22.330142 7.821906 29.71884 7.814280
## 26 5.761251 12.711056 6.106016 17.27062 7.732269
## 27 5.256139 25.917623 8.545189 36.65737 7.866378
## 28 6.417580 21.125274 6.262248 33.15073 6.823157
## 29 4.799713 20.156497 7.543439 26.81732 8.483466
## 30 4.308034 27.516105 5.732408 35.41114 5.544343
## 31 7.822976 17.618741 7.527027 27.69543 8.261940
## 32 7.293370 33.162322 16.689333 40.32116 13.861287
#total for all counts
hbca.total.summary <- hbca.harmony@meta.data %>%
summarise(mean_count = mean(nCount_RNA, na.rm = TRUE),
sd_count = sd(nCount_RNA, na.rm = TRUE),
mean_feature = mean(nFeature_RNA, na.rm = TRUE),
sd_feature = sd(nFeature_RNA, na.rm = TRUE),
mean_mt = mean(percent.mt, na.rm = TRUE),
sd_mt = sd(percent.mt, na.rm = TRUE),
mean_rb = mean(percent.rb, na.rm = TRUE),
sd_rb = sd(percent.rb, na.rm = TRUE),
mean_mtrb = mean((percent.mt + percent.rb), na.rm = TRUE),
sd_mtrb = sd((percent.mt + percent.rb), na.rm = TRUE))
as.data.frame(hbca.total.summary)
## mean_count sd_count mean_feature sd_feature mean_mt sd_mt mean_rb
## 1 4390.262 2887.998 1292.081 631.3039 9.516652 8.976824 25.88226
## sd_rb mean_mtrb sd_mtrb
## 1 12.08002 35.39892 11.5616
# mito cluster cutoff
mito.cluster.cutoff <- hbca.total.summary$mean_mt + 2*hbca.total.summary$sd_mt
mito.cluster.cutoff
## [1] 27.4703
# ribo cluster cutoff
ribo.cluster.cutoff <- hbca.total.summary$mean_rb + 2*hbca.total.summary$sd_rb
ribo.cluster.cutoff
## [1] 50.04231
# combined mito & ribo cluster cutoff
combined.cluster.cutoff <- hbca.total.summary$mean_mtrb + 2*hbca.total.summary$sd_mtrb
combined.cluster.cutoff
## [1] 58.52212
Without testing, you might consider dropping clusters where:
Mean is above this
low percent.mt - suggest high quality high percent.mt - suggest dead/dying, stressed or high metabolically active
VlnPlot(hbca.harmony, features = c("percent.mt"))
Visually you can look at the clusters and might note that the following clusters have:
low percent.rb - suggest high quality high percent.rb - suggest dead/dying
VlnPlot(hbca.harmony, features = c("percent.rb"))
Using these without doing official statistical tests, would remove the following clusters.
sapply(hbca.cluster.summary$mean_mt, function(x) {x > mito.cluster.cutoff})
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
sapply(hbca.cluster.summary$mean_rb, function(x) {x > ribo.cluster.cutoff})
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
sapply(hbca.cluster.summary$mean_mtrb, function(x) {x > combined.cluster.cutoff})
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
So, based on this we should include the following:
dropclusters <- hbca.cluster.summary$SCT_snn_res.1.2[sapply(hbca.cluster.summary$mean_mt, function(x) {x > mito.cluster.cutoff}) | sapply(hbca.cluster.summary$mean_rb, function(x) {x > ribo.cluster.cutoff}) | sapply(hbca.cluster.summary$mean_mtrb, function(x) {x > combined.cluster.cutoff})]
Based on this we will drop cluster(s) .
This matches more or less what we visually saw, possibly is a little more conservative.
We also want to look at the other metrics that might be important.
Here we want to consider the clusters that are very low.
VlnPlot(hbca.harmony, features = c("nFeature_RNA"))
VlnPlot(hbca.harmony, features = c("nCount_RNA"))
One thing to consider it that immune cells might also have a low number features. To this end, be careful about removing cells or clusters based solely on number of features.
Note in the example below I set the required number of feature to be less than required in the basic example tourtial.
feature.cluster.cutoff<- hbca.total.summary$mean_feature - 2*hbca.total.summary$sd_feature
feature.cluster.cutoff
## [1] 29.47317
counts.cluster.cutoff<- hbca.total.summary$mean_count - 2*hbca.total.summary$sd_count
counts.cluster.cutoff
## [1] -1385.734
While it is not logical, we can test even negative cut off, without an out of bounds check. However, to follow coding best practices if it smart to put in if/then or try/catch code for items like this. These types of coding provides a logic sanity check and prevents futures errors downstream if nonsense numbers are passed.
In this case, both cut offs are too low to be useful. But you can still double check for them using less than.
# add to drop clusters, only the unique clusters that should also be dropped
# these cut offs would be based on lowest.
dropclusters<- unique(c(dropclusters, hbca.cluster.summary$SCT_snn_res.1.2[sapply(hbca.cluster.summary$mean_feature, function(x) {x < feature.cluster.cutoff}) | sapply(hbca.cluster.summary$mean_count, function(x) {x < counts.cluster.cutoff})]))
Now, we can remove our marked clusters. You can filter out the items based on outliers by marking a cluster as an outlier but first you need to add it to the levels.
This usually favorably interactive, but note it must be recorded and reported when you are writing your methods. This is one of the parts I see the most people mess up on because they do not record how they filtered or dropped “bad” clusters, when they dropped them, how they decided that should be dropped and so forth.
levels(hbca.harmony@meta.data$SCT_snn_res.1.2) <- c(levels(hbca.harmony@meta.data$SCT_snn_res.1.2), "outlier")
if(length(dropclusters) >= 1 )
{
for(clust.num in dropclusters)
{
hbca.harmony@meta.data$SCT_snn_res.1.2[which(hbca.harmony@meta.data$SCT_snn_res.1.2 == clust.num)] <- "outlier"
}
}
hbca.harmony.cf <- subset(hbca.harmony, subset = SCT_snn_res.1.2 != "outlier")
Now that we have removed the cluster, we are going to set the mito/ribo filters that were previousely used by the paper.
Not the reason that you do this “second” is because the clustering was to pull out the bad cells based on them grouping together. The cluster clean up can be a fairly dangerous undertaking as it might remove clusters of interest, which are highly stressed due to treatment or nature. However, you need to know that people do that.
Here, after that step we want to consider removing the cells using more blanket filters such as was done during the first single sample examination.
hbca.harmony.filtered <- subset(hbca.harmony.cf, subset = nFeature_RNA > min_nFeature &
nFeature_RNA < max_nFeature &
nCount_RNA > min_nCount &
nCount_RNA < max_nCount &
percent.mt < max_mito &
percent.rb < max_ribo)
After you have filtered and cleaned up, it is good to “recluster”.
Sys.time()
## [1] "2024-03-25 19:28:56 CDT"
hbca.harmony.filtered <- RunHarmony(hbca.harmony.filtered, group.by.vars='Digestion', assay.use='SCT')
## Transposing data matrix
## Initializing state using k-means centroids initialization
## Harmony 1/10
## Harmony 2/10
## Harmony converged after 2 iterations
## Warning: Invalid name supplied, making object name syntactically valid. New
## object name is Seurat..ProjectDim.SCT.harmony; see ?make.names for more details
## on syntax validity
hbca.harmony.filtered <- RunUMAP(hbca.harmony.filtered, reduction = "harmony", dims = 1:paper_dim, reduction.name = 'filteredharmonyumap', reduction.key = 'filteredharmonyumap_')
## 19:29:09 UMAP embedding parameters a = 0.9922 b = 1.112
## 19:29:09 Read 23148 rows and found 30 numeric columns
## 19:29:09 Using Annoy for neighbor search, n_neighbors = 30
## 19:29:09 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 19:29:12 Writing NN index file to temp file /var/folders/3_/lw61drbn68xglctt7ljrsmrcd_ghr5/T//Rtmpsdg3ET/file93066f78de83
## 19:29:12 Searching Annoy index using 1 thread, search_k = 3000
## 19:29:17 Annoy recall = 100%
## 19:29:17 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 19:29:18 Initializing from normalized Laplacian + noise (using irlba)
## 19:29:20 Commencing optimization for 200 epochs, with 992604 positive edges
## 19:29:32 Optimization finished
hbca.harmony.filtered <- FindNeighbors(hbca.harmony.filtered, reduction = "harmony", dims = 1:paper_dim)
## Computing nearest neighbor graph
## Computing SNN
hbca.harmony.filtered <- FindClusters(hbca.harmony.filtered, resolution = paper_res)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 23148
## Number of edges: 826143
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9663
## Number of communities: 14
## Elapsed time: 3 seconds
hbca.harmony.filtered
## An object of class Seurat
## 56777 features across 23148 samples within 2 assays
## Active assay: SCT (23239 features, 3000 variable features)
## 1 other assay present: RNA
## 5 dimensional reductions calculated: pca, umap, harmony, harmonyumap, filteredharmonyumap
Sys.time()
## [1] "2024-03-25 19:29:40 CDT"
hbca.harmony.filtered <- FindClusters(hbca.harmony.filtered, resolution = paper_res)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 23148
## Number of edges: 826143
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9663
## Number of communities: 14
## Elapsed time: 3 seconds
harmony_filter_clusters_umap <- DimPlot(hbca.harmony.filtered, reduction = "filteredharmonyumap", group.by = "seurat_clusters")
harmony_filter_clusters_umap
Taking the Harmony interrogated data what might you name the different clusters?
How did you arrive at those names?
hbca.harmony.filtered <- FindClusters(hbca.harmony.filtered, resolution = 0.1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 23148
## Number of edges: 826143
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9800
## Number of communities: 12
## Elapsed time: 3 seconds
Umap of clusters
harmony_clusters_umap <- DimPlot(hbca.harmony.filtered, reduction = "filteredharmonyumap", group.by = "seurat_clusters")
harmony_clusters_umap
Heatmap plots of clusters. Note that I am using the RNA assay not the integrated assay. This is so that we are seeing what is different based on unchanged values, across the different clusters.
hbca.markers <- FindAllMarkers(hbca.harmony.filtered, assay="RNA", only.pos = TRUE)
## Calculating cluster 0
## For a more efficient implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the limma package
## --------------------------------------------
## install.packages('BiocManager')
## BiocManager::install('limma')
## --------------------------------------------
## After installation of limma, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
# you can change the fold change requirements
hbca.markers.top <- hbca.markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 0.8)
hbca.markers.top <- as.data.frame(hbca.markers.top)
#here we look at the just top 5 from the selected set
top.genes <- NULL
for(cluster.num in unique(hbca.markers.top$cluster))
{
my_genes <- hbca.markers.top$gene[which(hbca.markers.top$cluster == as.numeric(cluster.num))[1:5]]
# for the ones that have less than 5 that make it through
if(any(is.na(my_genes)))
{
my_genes <- my_genes[-which(is.na(my_genes))]
}
top.genes <- c(top.genes, my_genes)
}
DoHeatmap(subset(hbca.harmony.filtered, downsample = 100), features = top.genes, size = 3)
If you don’t want to look at the the top genes based on the clusters. You can also look at the canonical genes.
my_Guided_Markers_path <- paste0(your_working_dir,"/input/Guided_Markers.xlsx")
# read file in
user_markers <- readxl::read_excel(my_Guided_Markers_path)
user_markers <- as.data.frame(user_markers)
# select markers from the HBCA paper
my_makers_genes <- user_markers$Gene[user_markers$Sources=="HBCA_paper"]
DoHeatmap(subset(hbca.harmony.filtered, downsample = 100), features = my_makers_genes, size = 3)
You can also consider canonical markers or using a label transfer.
For further reading I suggest, reading about singleR - https://github.com/dviraran/SingleR There are also youtube videos that walk through much of this.
For reproducible research, one item to keep in mind it always good to track what packages you used so someone else can use it. Therefore, a good function and method is to always output this at the end of your sessions to you know what you did and had.
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] harmony_1.2.0 Rcpp_1.0.11 readxl_1.4.3 dplyr_1.1.3
## [5] patchwork_1.1.3 ggplot2_3.4.4 sp_1.5-0 SeuratObject_4.1.2
## [9] Seurat_4.2.0
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.16 colorspace_2.1-0 deldir_1.0-6
## [4] ellipsis_0.3.2 ggridges_0.5.4 rstudioapi_0.14
## [7] spatstat.data_2.2-0 farver_2.1.1 leiden_0.4.3
## [10] listenv_0.8.0 bit64_4.0.5 ggrepel_0.9.1
## [13] fansi_1.0.5 codetools_0.2-18 splines_4.2.0
## [16] cachem_1.0.8 knitr_1.44 polyclip_1.10-0
## [19] jsonlite_1.8.7 RhpcBLASctl_0.23-42 ica_1.0-3
## [22] cluster_2.1.3 png_0.1-8 rgeos_0.5-9
## [25] uwot_0.1.16 shiny_1.7.5.1 sctransform_0.3.5
## [28] spatstat.sparse_2.1-1 compiler_4.2.0 httr_1.4.7
## [31] Matrix_1.5-1 fastmap_1.1.1 lazyeval_0.2.2
## [34] cli_3.6.1 later_1.3.1 htmltools_0.5.6.1
## [37] tools_4.2.0 igraph_1.5.1 gtable_0.3.4
## [40] glue_1.6.2 RANN_2.6.1 reshape2_1.4.4
## [43] scattermore_0.8 cellranger_1.1.0 jquerylib_0.1.4
## [46] vctrs_0.6.4 nlme_3.1-158 progressr_0.11.0
## [49] lmtest_0.9-40 spatstat.random_2.2-0 xfun_0.40
## [52] stringr_1.5.0 globals_0.16.1 mime_0.12
## [55] miniUI_0.1.1.1 lifecycle_1.0.3 irlba_2.3.5.1
## [58] goftest_1.2-3 future_1.28.0 MASS_7.3-57
## [61] zoo_1.8-11 scales_1.2.1 spatstat.core_2.4-4
## [64] promises_1.2.1 spatstat.utils_2.3-1 parallel_4.2.0
## [67] RColorBrewer_1.1-3 yaml_2.3.7 reticulate_1.26
## [70] pbapply_1.5-0 gridExtra_2.3 sass_0.4.7
## [73] rpart_4.1.16 stringi_1.7.12 rlang_1.1.1
## [76] pkgconfig_2.0.3 matrixStats_1.0.0 evaluate_0.22
## [79] lattice_0.20-45 ROCR_1.0-11 purrr_1.0.2
## [82] tensor_1.5 labeling_0.4.3 htmlwidgets_1.6.2
## [85] bit_4.0.4 cowplot_1.1.1 tidyselect_1.2.0
## [88] parallelly_1.32.1 RcppAnnoy_0.0.21 plyr_1.8.7
## [91] magrittr_2.0.3 R6_2.5.1 generics_0.1.3
## [94] DBI_1.1.3 withr_2.5.1 mgcv_1.8-40
## [97] pillar_1.9.0 fitdistrplus_1.1-8 survival_3.3-1
## [100] abind_1.4-5 tibble_3.2.1 future.apply_1.9.1
## [103] hdf5r_1.3.6 KernSmooth_2.23-20 utf8_1.2.3
## [106] spatstat.geom_2.4-0 plotly_4.10.2 rmarkdown_2.25
## [109] grid_4.2.0 data.table_1.14.8 digest_0.6.33
## [112] xtable_1.8-4 tidyr_1.3.0 httpuv_1.6.11
## [115] munsell_0.5.0 viridisLite_0.4.2 bslib_0.5.1